Prosper Loan data provided by Udacity
Prosper.com is a P2P lending marketplace, similarly with LendingClub.com, with the notorious IPO in 2014. Prosper Loan data provided by Udacity was last updated 03/11/2014
## [1] "/Users/tommyly/Documents/Udacity/p4 - Explore and Summarize Data with R"
Let’s checkout the initial structure of the dataset
## [1] "ListingKey"
## [2] "ListingNumber"
## [3] "ListingCreationDate"
## [4] "CreditGrade"
## [5] "Term"
## [6] "LoanStatus"
## [7] "ClosedDate"
## [8] "BorrowerAPR"
## [9] "BorrowerRate"
## [10] "LenderYield"
## [11] "EstimatedEffectiveYield"
## [12] "EstimatedLoss"
## [13] "EstimatedReturn"
## [14] "ProsperRating..numeric."
## [15] "ProsperRating..Alpha."
## [16] "ProsperScore"
## [17] "ListingCategory..numeric."
## [18] "BorrowerState"
## [19] "Occupation"
## [20] "EmploymentStatus"
## [21] "EmploymentStatusDuration"
## [22] "IsBorrowerHomeowner"
## [23] "CurrentlyInGroup"
## [24] "GroupKey"
## [25] "DateCreditPulled"
## [26] "CreditScoreRangeLower"
## [27] "CreditScoreRangeUpper"
## [28] "FirstRecordedCreditLine"
## [29] "CurrentCreditLines"
## [30] "OpenCreditLines"
## [31] "TotalCreditLinespast7years"
## [32] "OpenRevolvingAccounts"
## [33] "OpenRevolvingMonthlyPayment"
## [34] "InquiriesLast6Months"
## [35] "TotalInquiries"
## [36] "CurrentDelinquencies"
## [37] "AmountDelinquent"
## [38] "DelinquenciesLast7Years"
## [39] "PublicRecordsLast10Years"
## [40] "PublicRecordsLast12Months"
## [41] "RevolvingCreditBalance"
## [42] "BankcardUtilization"
## [43] "AvailableBankcardCredit"
## [44] "TotalTrades"
## [45] "TradesNeverDelinquent..percentage."
## [46] "TradesOpenedLast6Months"
## [47] "DebtToIncomeRatio"
## [48] "IncomeRange"
## [49] "IncomeVerifiable"
## [50] "StatedMonthlyIncome"
## [51] "LoanKey"
## [52] "TotalProsperLoans"
## [53] "TotalProsperPaymentsBilled"
## [54] "OnTimeProsperPayments"
## [55] "ProsperPaymentsLessThanOneMonthLate"
## [56] "ProsperPaymentsOneMonthPlusLate"
## [57] "ProsperPrincipalBorrowed"
## [58] "ProsperPrincipalOutstanding"
## [59] "ScorexChangeAtTimeOfListing"
## [60] "LoanCurrentDaysDelinquent"
## [61] "LoanFirstDefaultedCycleNumber"
## [62] "LoanMonthsSinceOrigination"
## [63] "LoanNumber"
## [64] "LoanOriginalAmount"
## [65] "LoanOriginationDate"
## [66] "LoanOriginationQuarter"
## [67] "MemberKey"
## [68] "MonthlyLoanPayment"
## [69] "LP_CustomerPayments"
## [70] "LP_CustomerPrincipalPayments"
## [71] "LP_InterestandFees"
## [72] "LP_ServiceFees"
## [73] "LP_CollectionFees"
## [74] "LP_GrossPrincipalLoss"
## [75] "LP_NetPrincipalLoss"
## [76] "LP_NonPrincipalRecoverypayments"
## [77] "PercentFunded"
## [78] "Recommendations"
## [79] "InvestmentFromFriendsCount"
## [80] "InvestmentFromFriendsAmount"
## [81] "Investors"
## 'data.frame': 113937 obs. of 81 variables:
## $ ListingKey : Factor w/ 113066 levels "00003546482094282EF90E5",..: 7180 7193 6647 6669 6686 6689 6699 6706 6687 6687 ...
## $ ListingNumber : int 193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
## $ ListingCreationDate : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
## $ CreditGrade : Factor w/ 9 levels "","A","AA","B",..: 5 1 8 1 1 1 1 1 1 1 ...
## $ Term : int 36 36 36 36 36 60 36 36 36 36 ...
## $ LoanStatus : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
## $ ClosedDate : Factor w/ 2803 levels "","2005-11-25 00:00:00",..: 1138 1 1263 1 1 1 1 1 1 1 ...
## $ BorrowerAPR : num 0.165 0.12 0.283 0.125 0.246 ...
## $ BorrowerRate : num 0.158 0.092 0.275 0.0974 0.2085 ...
## $ LenderYield : num 0.138 0.082 0.24 0.0874 0.1985 ...
## $ EstimatedEffectiveYield : num NA 0.0796 NA 0.0849 0.1832 ...
## $ EstimatedLoss : num NA 0.0249 NA 0.0249 0.0925 ...
## $ EstimatedReturn : num NA 0.0547 NA 0.06 0.0907 ...
## $ ProsperRating..numeric. : int NA 6 NA 6 3 5 2 4 7 7 ...
## $ ProsperRating..Alpha. : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 3 3 ...
## $ ProsperScore : num NA 7 NA 9 4 10 2 4 9 11 ...
## $ ListingCategory..numeric. : int 0 2 0 16 2 1 1 2 7 7 ...
## $ BorrowerState : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
## $ Occupation : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
## $ EmploymentStatus : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
## $ EmploymentStatusDuration : int 2 44 NA 113 44 82 172 103 269 269 ...
## $ IsBorrowerHomeowner : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
## $ CurrentlyInGroup : Factor w/ 2 levels "False","True": 2 1 2 1 1 1 1 1 1 1 ...
## $ GroupKey : Factor w/ 707 levels "","00343376901312423168731",..: 1 1 335 1 1 1 1 1 1 1 ...
## $ DateCreditPulled : Factor w/ 112992 levels "2005-11-09 00:30:04.487000000",..: 14347 111883 6446 64724 85857 100382 72500 73937 97888 97888 ...
## $ CreditScoreRangeLower : int 640 680 480 800 680 740 680 700 820 820 ...
## $ CreditScoreRangeUpper : int 659 699 499 819 699 759 699 719 839 839 ...
## $ FirstRecordedCreditLine : Factor w/ 11586 levels "","1947-08-24 00:00:00",..: 8639 6617 8927 2247 9498 497 8265 7685 5543 5543 ...
## $ CurrentCreditLines : int 5 14 NA 5 19 21 10 6 17 17 ...
## $ OpenCreditLines : int 4 14 NA 5 19 17 7 6 16 16 ...
## $ TotalCreditLinespast7years : int 12 29 3 29 49 49 20 10 32 32 ...
## $ OpenRevolvingAccounts : int 1 13 0 7 6 13 6 5 12 12 ...
## $ OpenRevolvingMonthlyPayment : num 24 389 0 115 220 1410 214 101 219 219 ...
## $ InquiriesLast6Months : int 3 3 0 0 1 0 0 3 1 1 ...
## $ TotalInquiries : num 3 5 1 1 9 2 0 16 6 6 ...
## $ CurrentDelinquencies : int 2 0 1 4 0 0 0 0 0 0 ...
## $ AmountDelinquent : num 472 0 NA 10056 0 ...
## $ DelinquenciesLast7Years : int 4 0 0 14 0 0 0 0 0 0 ...
## $ PublicRecordsLast10Years : int 0 1 0 0 0 0 0 1 0 0 ...
## $ PublicRecordsLast12Months : int 0 0 NA 0 0 0 0 0 0 0 ...
## $ RevolvingCreditBalance : num 0 3989 NA 1444 6193 ...
## $ BankcardUtilization : num 0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
## $ AvailableBankcardCredit : num 1500 10266 NA 30754 695 ...
## $ TotalTrades : num 11 29 NA 26 39 47 16 10 29 29 ...
## $ TradesNeverDelinquent..percentage. : num 0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
## $ TradesOpenedLast6Months : num 0 2 NA 0 2 0 0 0 1 1 ...
## $ DebtToIncomeRatio : num 0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
## $ IncomeRange : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
## $ IncomeVerifiable : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
## $ StatedMonthlyIncome : num 3083 6125 2083 2875 9583 ...
## $ LoanKey : Factor w/ 113066 levels "00003683605746079487FF7",..: 100337 69837 46303 70776 71387 86505 91250 5425 908 908 ...
## $ TotalProsperLoans : int NA NA NA NA 1 NA NA NA NA NA ...
## $ TotalProsperPaymentsBilled : int NA NA NA NA 11 NA NA NA NA NA ...
## $ OnTimeProsperPayments : int NA NA NA NA 11 NA NA NA NA NA ...
## $ ProsperPaymentsLessThanOneMonthLate: int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPaymentsOneMonthPlusLate : int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPrincipalBorrowed : num NA NA NA NA 11000 NA NA NA NA NA ...
## $ ProsperPrincipalOutstanding : num NA NA NA NA 9948 ...
## $ ScorexChangeAtTimeOfListing : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanCurrentDaysDelinquent : int 0 0 0 0 0 0 0 0 0 0 ...
## $ LoanFirstDefaultedCycleNumber : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanMonthsSinceOrigination : int 78 0 86 16 6 3 11 10 3 3 ...
## $ LoanNumber : int 19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
## $ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
## $ LoanOriginationDate : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
## $ LoanOriginationQuarter : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
## $ MemberKey : Factor w/ 90831 levels "00003397697413387CAF966",..: 11071 10302 33781 54939 19465 48037 60448 40951 26129 26129 ...
## $ MonthlyLoanPayment : num 330 319 123 321 564 ...
## $ LP_CustomerPayments : num 11396 0 4187 5143 2820 ...
## $ LP_CustomerPrincipalPayments : num 9425 0 3001 4091 1563 ...
## $ LP_InterestandFees : num 1971 0 1186 1052 1257 ...
## $ LP_ServiceFees : num -133.2 0 -24.2 -108 -60.3 ...
## $ LP_CollectionFees : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_GrossPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NetPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NonPrincipalRecoverypayments : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PercentFunded : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Recommendations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsCount : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsAmount : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Investors : int 258 1 41 158 20 1 1 1 1 1 ...
## [1] 113937
## [1] 81
## AA A B C D E HR NA NA's
## 5372 14551 15581 18345 14274 9795 6935 0 29084
The majority of loan are non-classify and among those are being rated, “C” rating loan request are the majority.
## 1 2 3 4 5 6 7 8 9 10 11 NA's
## 992 5766 7642 12595 9813 12278 10597 12053 6911 4750 1456 29084
Here we see another rating method. Again, most of the loans are not properly rated and rating is pretty concentrated from 4-8 range.
Surprisigly, people with less than 25,000 USD annual income were not the majority of borrowers. I suspect this trend indicate the customer segment Prosper Loan’s going after. It’s possible the main target are future home-owner, car-owner, studebt-debt payment.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.140 0.220 0.276 0.320 10.010 8554
The data has a long-tailed right-skewed but as expected. It’s expected the majority of people in U.S have a credit history, and the ratio should be low enough for a secured repayment. Seems like 25% is the threshold for most borrower.
## Debt Consolidation Home Improvement Business
## 58308 7433 7189
## Personal Loan Student Use Auto
## 2395 756 2572
## Baby & Adoption Boat Cosmetic Procedure
## 199 85 91
## Engagement Ring Green Loans Household Expenses
## 217 59 1996
## Large Purchases Medical/Dental Motorcycle
## 876 1522 304
## RV Taxes Vacation
## 52 885 768
## Wedding Loans Other Not Available
## 771 10494 16965
Again, we see not many people’re willing to explain the purpose of loan listing. There’s a surprisingly amount of needs for debt consolidation, more than 50%. Now the data for income range makes perfect sense, as young people are going out to the real world and start to repay their student debt, purchase cars, mortgage their apartment, etc. Therefore, people already been in the workforce for many decades wouldn’t need much Prosper service.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2005 2008 2012 2011 2013 2014
The chart took a plunge in 2009 and exploded the year after and peaked in 2013. Now, this period is coincide with the 2008 Global Financial Crisis and government must have conducted certain action toward credit-typed business. Doing a quick Google search, we now understand the period of October 15, 2008 to July 13, 2009 is Prosper’s QuietSEC Period, from which they are suspended for lending activities upon SEC approval. Prosper relaunched in July 2009, most notable hypothesis is Prosper needed to adjust their scoring algorigthm to ensure tighter interest rate & stricter rating criteria.
Again, we so almost little to no loan activities in 2009 and the extremely high growth in 2013. Probably in 2013, business went back to normal and the demand for credit increased.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 6500 8337 12000 35000
Here is another interesting insight, the minimum loan amoutn is $1,000 USD and people tend to borrow in 4 main range ($5000, $10000, $15000, $20000). Although this might be more than enough for them to cover their original need, people tend to check these rounded amount boxes.
## Employed Full-time Not available Not employed
## 2255 67322 26355 5347 835
## Other Part-time Retired Self-employed
## 3806 1088 795 6134
Among these criteria, probably people would tick in “Employed” box & would not fill-in other details, which also explained why self-employed borrowers are higher than part-time employment.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 3200 4667 5608 6825 1750000
Not so many intresting insight here, median income range is at $4667 and mean of $5608 which is quite average for a professional with few years of experience.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 2.167 5.583 8.006 11.420 62.920 7625
As expected, I see the data is in favor of people have fewer years of employment and as they aging, they do not need much loan as before.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1340 0.1840 0.1928 0.2500 0.4975
As an outsider from finance industry, it’s interesting to know the median borrower rate. I’d suspect they have different borrower rate for different term & rating but here we get a median of 18.4%. However, there is a huge notable spike around 31%, which is a very attractive rate for the investors. This is something we can keep in mind while looking for further investigation.
This is an interesting piece of information we would want to investigate further. 36-month term loan is the most sought-after term, if we couple this with the previous common Borrower Rate, a 3 years loan with median 18.4% borrower rate is a very attractive rate for this business.
This chart is quite similar with the previous chat for borrower rate. As someone who’s completely new to the industry, I did not fully comprehend there’s a very linear and strong relationship between borrower rate and lender yield, which unfold later on during bivariate analysis.
##
## FALSE TRUE
## 86123 27814
Out of 113937 loans, about 24.4% are funded individually. This is a very high concentration toward 1 investor. However, if we keep in mind 75% of loan are below $12000. This would make sense.
The dataset is comprised of 88 variables with 113937 observations. The loans cover the period 2005 & 2014. Variables are of classes int, numeric, date, and factor. Although 88 variables seems to be intimidating at first, we can split these variables into 2 main players: the “Borrowers” variables & “Investors” variables.
For the sake of this analysis, let’s split the dataset into 2 different players:
For Borrower, I believe the Prosper Score, Proser Rating (both numeric & alphabet) are the main indicators of a quality of borrowers. Further bivariate analysis would help to learn about these 2.
For Lender, I now understand Lender Yield is the most important factor for investor.
As a marketplace, Prosper would be more interest in Service Fees & Collection Fees but we would not dive into these variables in our analysis.
So far, I have seen some evidences that people take more loans in specific months and the interest rate of these loans are strongly related to the term of the each loan. There is something curious about loan with 37%-38% of interest rate and I have not understood why loan with term of 36 month are so much more dispersed than 60 months, although the median of interest rates on 60 month are slightly greater than 36 months.
Also, I didn’t find nothing so interesting about risk score and income level, two variable that I was expecting more. On the other hand, maybe this variables have more to tell when related to others. I would like to understand better why loans with term of 36-months are more dispersed than 60 months. Also I want to find out why risky loans are classified that way.
No. I was factorizing a few variables but did not create any new one.
Yes, there is few unsual distribution, I was particularly pay attention to the high spike in lender yield & borrower rate and the spike in LoanOriginalAmount that people tend to buy in bulk. I was factorizing data and since categorical variables enter into statistical models differently than continuous variables, storing data as factors insures that the modeling functions will treat such data correctly.
As we can see, with better rating and safe borrower, the borrower rate is having an inveser relationship with the prosper rating.
The lower the rank, the more loans get defaulted. The graph might be mistaken as E and HR have less defaulted loans than others, but also keeps in mind loans were less categorized into that area.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 38400 56000 67300 81900 21000000
For this graph, we explicitly excluded MonthlyIncome more than 9000 and no income as there’re few outliers have monthly income more than 9000 and the chart cannot displayed properly.
Retired category seems to have higher loan original amount than part-time. This part of data is quite interesting but I don’t really know why such event would happen.
## List of 1
## $ axis.text.x:List of 8
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : NULL
## ..$ hjust : num 1
## ..$ vjust : NULL
## ..$ angle : num 45
## ..$ lineheight: NULL
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## - attr(*, "class")= chr [1:2] "theme" "gg"
## - attr(*, "complete")= logi FALSE
## [1] 0.06291678
As much as I want to believe, although there is some sort of relationship between Debt To Income ratio & Borrower Rate, the relationship is not that significant when we do a correlation test. I also did a few back-of-the-envelope graph and shows no significant relationship between BorrowerRate & other factors.
Similar graph with Borrower Rate, nothing worth notice here.
I initially thought there must be better yield for riskier category but it seems like there’s no relationship here.
This chart is a new interesting insight, although the majority of loan are in 36-month term. The Loan original amount is significantly higher for 60 months term. Let’s see if it’s hold true for Lender Yield
Unfortunately, this is not the case. Term Loan have higher Loan Original Amount because it’d need more time to repay those loans.
I want to explore 2 main variables Lender Yield and Borrower Rate and see which other related variables have the most effect on it. So far, the only relationship I found is through the proprietory Prosper Scoring system. Other factor I was trying to compare was not having any particular relationship. #### Did you observe any interesting relationships between the other features (not the main feature(s) of interest)? No #### What was the strongest relationship you found? The relationship between (Prosper Rating and Lender Yield) and (Prosper Rating and Borrower Rate) has an inversed relationship. The higher the rating, the lower the borrower rate and lender yield.
Here is how we take a closer look at Lender Yield vs Prosper Rating and how Prosper Rating was influced by Debt to Income Ratio.
This is a closer look for lender yield vs prosper rating. The majority of loans opt-in for 36-month term and the return for 36-month and 60-month are just higher than 12-month, also considering the fact there’re less loan in 12-month term than other term.
Prosper must have optimized their model throughout the year and as we see the borrower throughout the year, the variation between borrower rate is not that significant anymore and we tend to have smaller standard deviation year-over-year. Something worth noticing is the amount of borrowing suddenly decreased in 2013.
Term loan is quite a good indicator whether we have a better Lender Yield or not. Also, we see how three variables Lender Yield, Prosper Rating and Debt To Income Ratio come together and how it affect each order. ### Were there any interesting or surprising interactions between features? The criteria for being an AA borrowers seems to be tighten over the year and a there seems to be a fixed borrower rate in criteria HR and AA. ### OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model. No ——
This basically best summarize the relationship between Lender Yield and Prosper Rating. The higher the risk, the lower the rating, the better the lender yield. We also noticed high rank like AA would not have D/I ration more than 25% and although most borrowers have lower D/I Ratio, there’re still high D/I ratio borrowers and fall in lower ProsperRating. Therefore, we see a somewhat upward triangular shape.
This chart confirmed that although there is higher lending yield for lower rating. The investors should also takes a look at the risk of the loan being defaulted. We see very clear in this chart as the rating get worsen, the higher chance the loan getting defaulted.
This graph shows a very interesting time in U.S credit business history where most of the banks and financial institution gone under strong investigation from the SEC, only to recover and went back to business and growein stronger than ever.
I’d almost considering switching to another dataset as the more I’m looking at, the more I get confused and see no significant correlation between the variables. I was suspecting to see few obvious trend and correlation without much investigation but I was wrong. Although there are more than 30 plots in this report. Lots of charts were never appeared here as I did lots of back-of-the-envelope visualisation but wouldn’t make any good story. In addition to that, I have to go through each variable one by one, try to understand the meaning as I was not an expert in the financial industry.
I then categorize these variables into three different players: Prosper company, lenders & borrowers. Only then the variables makes a lot more sense & I’m trying to make assumption a long the way to explore it. At the end of the project, it did give me a sense of understanding for the lending business.
As I progress through the analysis, I only found Debt to Income ratio is the predictor of Lending Yield and Borrower Rate and the other variables seems not to play that big of a role. To expand to project, we can try to combine various variables and make a predictive model using logistic regression for a go/no-go investment decision or any classification algorigthm would make sense.